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Although numerous studies have been conducted on handwritten recognition, 
there is little and non-optimal research on Javanese script recognition due to 
its limitation to basic characters. Therefore, this research proposes the design 
of a handwritten Javanese Script recognition method based on twelve layers 
deep convolutional neural network (DCNN), consisting of four convolutions, 


two pooling, and five fully connected (FC) layers, with SoftMax classifiers. 


Five FC layers were proposed in this research to conduct the learning process 
in stages to achieve better learning outcomes. Due to the limited number of 
images in the Javanese script dataset, an augmentation process is needed to 
improve recognition performance. This method obtained 99.65% accuracy 
using seven types of geometric augmentation and the proposed DCNN model 
for 120 Javanese script character classes. It consists of 20 basic characters plus 
100 others from the compound of basic and vowels characters. 
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1. INTRODUCTION 

Indonesia is a country comprising numerous ethnic groups and various languages and cultures. One of 
the largest ethnic groups is the Javanese, who use the Javanese language originally written with the Javanese 
script. This language is currently rarely used by this ethnicity, therefore it needs to be preserved. Technology- 
based learning of the Javanese script is one way to re-popularize the writing of this language. This research 
proposed a highly accurate Javanese script recognition method. Many recognition methods have been proposed. 
Some are used for Javanese script recognition [1]-[4], as well as non-Latin languages, such as Arabic [5]-[7], 
Tamil [8], Bangla or Bengali [9]-[11], Kannada [12], Gurmukhi [13], Tifinagh [14], and Thai [15]. Non-Latin 
character recognition is usually more difficult due to limited research and datasets and the relatively complex 
shapes of the character. This is also proven in the study by [16] that certain algorithms have better accuracy when 
interpreting Latin characters than Javanese scripts. 

Preliminary studies have been carried out on handwritten Javanese script recognition, such as those by 
[4] and [1]-[3], which are based on machine learning and deep learning, respectively. However, the results 
obtained are still unsatisfactory because they are limited to basic characters (Carakan). To make a good sentence 
with Javanese script, the basic (Carakan), vowels (Sandhangan Swara), and consonant scripts (Sandhangan 
Panyigeg and Sandhangan Wyanjana), including numbers, and punctuation, are required. The vowel, consonant, 
and basic scripts are used to turn off vocal reading. The vowel and consonant scripts are only used in the middle 
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of words or sentences. The Javanese script is written from left to right, while that of sandhangan is different, i.e. 
left, right, top, and bottom. Figure | is an example of a typical script, lines 1 and 2 are Javanese scripts, while the 
subsequent one is a basic script compounded with vowels. This is because its recognition is more complicated 
than Latin characters. One of the most accurate studies on Javanese script recognition was carried out by [1], who 
proposed a convolutional neural network (CNN) method. This approach consists of three convolutional and 
pooling, as well as two fully connected layers, yielding a recognition accuracy of 94.57 percent for 20 basic 
Javanese scripts. 

This study proposes a method to improve the recognition accuracy of Javanese script that is not limited 
to the basic script compounded with vowels using a deep convolutional neural network (DCNN) and data 
augmentation. Data augmentation is used to enrich the relatively small number of the dataset used in this research. 
This manuscript consists of five parts, namely section 1, the introduction. Section 2 centered on motivations and 
explained why DCNN and data augmentation were proposed, including related research, and the contributions 
were made. Meanwhile, section 3 describes the detailed steps of the proposed method. Sections 4 and 5 explain 
the results and analysis, including the implementation of the method and conclusion, respectively. 
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Figure 1. Javanese script characters 


2. MOTIVATION AND CONTRIBUTION 

Several studies on handwritten Javanese Script recognition have been carried out, including the one 
by [3]. It involved using several artificial neural network methods to recognize 20 basic characters, four vowel 
scripts, and seven numbers. The handwritten Javanese script image was initially read and converted to 
grayscale. Besides, several pre-processing procedures such as slope detection and correction are carried out 
and then segmented by thresholding and skeletonizing. After the area of the character has been obtained, it is 
divided into 4x5 zones, where feature extraction is carried out on each of them with the image centroid and 
zone (ICZ) as well as zone centroid and zone (ZCZ) methods. Additionally, 40 ICZ-ZCZ was realized and used 
for ANN input classification. There are several classification methods, such as the counter propagation network 
(CPN), backpropagation neural network (BPNN), and evolutionary neural network (ENN), as well as a 
combination of the Chi2 and BPNN, approaches. It was reported that these methods produced the best 
classification with an accuracy of 73.71%. 

The research carried out by [4] used the k-nearest neighbor (KNN) classifier method combined with 
roundness and eccentricity feature extraction to recognize 20 basic Javanese scripts with relatively few datasets 
consisting of 240 images. However, the proposed method has an accuracy of approximately 87.5%. The 
performance of the recognition process is fairly good because the pre-processing stage consists of binarization, 
median filter, and dilation. 

In the research carried out by [1], one of the deep learning algorithms employed to recognize 20 basic 
Javanese scripts is CNN, and 11,000 datasets were used. The proposed method is divided into two models. Model 
1 consists of two convolutions, three pooling, and one fully connected layer. In contrast, model 2 uses similar 
layers and an additional fully connected layer at the end before the classification process. Each model was tested 
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with 0.006 and 0.01 learning rates and 0.0005 and 1.00E-004 regularization. It was discovered that model 2, which 
uses a 0.01 learning rate and 0.0005 regularizations, had the best performance with an accuracy of 94.57%. 

Based on some preliminary studies, the research on Javanese script recognition still has a great 
opportunity to be improved. Interestingly, recognition is mostly limited to basic characters, and the performance 
of its method still needs to be re-optimized. Considering the recognition of other scripts, such as Tamil and 
Bengali, which have similar writing systems with abugida, and are both derived from Brahmi script, the results 
tend to be better. In the research carried out by [8], the CNN method was used to recognize handwritten Tamil 
characters. A total of 82,929 images were extracted from the online version with linear interpolation and constant 
thickening factor and were normalized by resizing to 64x64. All the images were processed by the CNN 
algorithm, which consists of five convolution, two max-pooling, and fully connected layers. Several 
hyperparameters are also used in this method, namely initialization=Xavier, batch size=64, optimizer=Adam, 
epoch=100, learning rate 0.001 and activation function= rectified linear unit (ReLU). As a result, this approach 
has an accuracy of relatively 97.7% in terms of testing the data on 156 handwritten Tamil character classes. 

In the research carried out by [10], the method for performing Bangla character recognition using 
DCNN and squeeze and excitation (SE)-ResNeXt, was proposed. The dataset used is BanglaLekha-Isolated 
(Biswas et al. 2017), which consists of 50 basic, 10 numeric, and 24 compound characters. The image in the 
dataset has a size of 150x150 to 185x185, which is further normalized and resized to 32x32 pixels. Additionally, 
all data are then processed using six process layers. The first is a 3x3 convolution block with 64 filters, the second 
layer consists of SE-ResNeXt Block-1 with 64 filters, and the third is a SE-ResNeXt Block-2 with 128 filters. 
The fourth, fifth and sixth are SE-ResNeXt Block-3, AVG global pooling, and fully connected layers. This 
approach has an accuracy of relatively 99.82%. 

Another deep learning recognition method was used to decipher the Gurmukhi character by [13]. This 
research used a combination of both offline and online learning features to recognize Gurmukhi handwriting. A 
pre-training model was adopted in the learning architecture on offline data to classify images consisting of simple 
lines with classes. Therefore, only the lower-level layers were used to study low-level features in the image. The 
processed results are passed to two of the fully connected layer with 512 neurons, 40% dropout and a ReLU 
activation layer. The SoftMax activation layer is used in the output, while the root mean squared propagation 
(RMSprop) optimizer was adopted to perform multiclass classification. Three blocks of the CNN layer are used 
based on the online aspect. The first one has two 1D convolution layers with 64 filters and 1D max-pooling. The 
second block has two layers of 1D convolution with 128 filters and 1D max-pooling. The third has 1D convolution 
with 128 filters and 1D max-pooling. The CNN layers output is flattened before passing to the fully connected 
layer with 512 neurons and drops out by 30%. Like the offline aspect, the online aspect also uses the ReLU and 
SoftMax activation layers and RMSprop optimizer. The best accuracy was relatively 97.44%, with 90% training 
and 10% testing data based on the test results. 

Another study that has similar objects is [17]. This research employed a combination of the multi 
augmentation technique (MAT), adaptive Gaussian thresholding, convolutional autoencoder (AGCA) and CNN 
to recognize Balinese script. Augmentation improves recognition performance on a relatively small dataset, 
namely 1197 Balinese character images written in the papyrus manuscript and 18 classes. The MAT-AGCA 
method produced 3159 datasets consisting of 2835 training, 216 validation, and 108 tests. This method has the 
highest accuracy of 96.29%, with MobileNetV2 as the pre-trained model. The augmentation model provides high 
accuracy, with recognition of 40.74%. 

Based on related research, it was concluded that the deep learning method, especially convolution, has 
been proven to have excellent performance for handwriting recognition and various derivatives of the Brahmi 
script. Currently, preliminary studies on the Javanese script are limited to basic characters. Therefore, this research 
was carried out to optimize Javanese script recognition accuracy by designing an appropriate DCNN model. This 
study recognizes the basic characters, compound vowels script, and 120 classes. The number of classes is 
relatively much more than the previous Javanese script recognition research. The dataset used is quite limited, 
and a data augmentation process was carried out to improve learning performance in this research. 


3. PROPOSED METHOD 

The research proposes a recognition method that uses DCNN as the main algorithm. This approach 
has proven to have good performance in various image classification process, especially for handwritten, 
printed, and digital text recognitions, both in modern Latin characters and traditional scripts of various 
languages [5], [6], [14], [18]-[24]. Before carrying out the convolution and learning processes, the image 
dataset is pre-processed to ensure accuracy, including the grayscaling, cropping, negative image, resizing, and 
data augmentation processes. Data augmentation is carried out to ensure the datasets vary. Besides, it is 
conducted to improve the classification accuracy performance [17], [25]-[29]. Figure 2 shows the method 
proposed in this research, further described in detail in subsections 3.1 to 3.3. 
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Figure 2. Proposed method 


3.1. Pre-processing 

The image datasets used in this research need to be normalized to improve the classification 
performance. Several processes are carried out in the pre-processing stage, and the first is the conversion to 
grayscale. This simplifies the image and reduces computational complexity because the calculations are only 
carried out on one layer. The handwritten image is relatively not concerned with color features in terms of 
deciphering its meaning because the writing patterns only consist of lines and dots. The text could be any color, 
but one that contrasts with the background is recommended. The second aspect is the image cropping process. 
This procedure has a square shape with a size of NxN. It aims to reduce the empty writing area and not change 
its shape during the resizing process. The third is converting the image to a negative one. This is carried out 
because there is a binarization procedure in most segmentation processes where the object and its background 
are generally converted to 1 (white) and 0 (black), respectively. This concept is also widely applied to the 
recognition methods to change the images to their complementary form [4], [19], [30]. The four of them are 
resized to 64x64 pixels. The aim is to reduce the computational process considering that deep learning requires 
expensive resources and computations. 


3.2. Data augmentation 

Data augmentation processes such as recognition, and classification, are mostly employed in the 
learning process. The goal is to increase the dataset to learn more, thereby improving the accuracy performance. 
This is chiefly carried out in various deep learning processes that have relatively small datasets, such as in 
research [17], [26], [31], [32]. Image augmentation can be performed in various ways, namely rotation, scaling, 
width, and height shifts, filtering, flipping, stretching, squeezing, affine, and projection transformation, gamma 
correction, noise injection, and color augmentation. [17], [33], [34]. It can also be performed during various 
image processing. Although, in recognition of handwritten objects, some augmentations are not used in this 
research because they have the potential to reduce recognition performance and alter the meaning of the 
writing, noise injection, flipping, and rotation to a certain degree. 

In contrast, noise injection and gamma correction are not used because the image dataset was directly 
taken from the smartphone notepad handwriting application to ensure it has good quality. The augmentation 
processes used in this research are rotation, scaling, affine and projection transformations, and squeezing. 
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However, these are liable to change the image's size, and this led to the carrying out of several normalization 
procedures afterwards. In a more detailed manner, each augmentation process is performed, as shown in 
Table 1. The t value in Table | is the transform form matrix used in geometric transformation. In this type of 
augmentation, rotation and squeezing produce two images. This is due to the different rotation directions, 
whereas shifting and squeezing depend on the width or height. Therefore, seven types of augmentation are used 
in this study. 


Table 1. Data augmentation details 
Augmentation Type Specification 
Affine 2D-transform t =[10.30;0.110;00 1] 
Projective 2D-transform t = [10—0.002;0.3 1 — 0.0002;00 1] 


Rotation 3° and -3° 
Scaling From 64 pixels to 54 pixels in width and height 
Squeezing From 64 pixels to 44 pixels in width or height 


3.3. DCNN model 

After the pre-processing and augmentation stages, the dataset is divided into two groups, namely 
training and testing with DCNN. This is feature extraction and deep learning method widely used in image 
recognition and classification [11], [35], [36]. Various DCNN models have been proposed, while the one used 
in this study has 12 layers. It consists of four layers in which every two convolution layers (C) are inserted in 
the max-pooling (MP), followed by five fully connected (FC) and softmax classifier (SC) layers. The first six 
layers, namely 2-C, 1-MP, 2-C, and 1-MP, are used to perform feature extraction, in which the two initial 
convolution layers are given a dropout value of 0.2 with 16 and 32 filters, respectively. The MP layer uses the 
feature map function to reduce output size and control overfitting. Next, two more convolution layers were 
performed with 32 and 64 filters with a dropout value of 0.3 each, followed by an MP layer. Each convolution 
layer uses the ReLU activation function to perform thresholding, where every (x) value less than 0 is converted 
to 0 and is calculated using (1) [14]. 


f (x) = max (0,x) (1) 


The FC layer processes the data, which aims to transform and classify its dimensions linearly. Each 
neuron in the convolution layer needs to be transformed into one-dimensional data before it can be entered into 
an FC layer. Meanwhile, five FC layers were proposed in this research to carry out the learning process in 
stages to achieve better outcomes. Each FC layer is given a dropout value of 0.2, 0.3, 0.3, 0.2, and 0.2, whose 
names are generated from the best test results. This is inspired by several studies utilizing multiple FC layers 
to maximize learning. The data is entered into the SoftMax classifier to obtain the recognition results in the 
last stage. SoftMax classifier was selected because it provides more intuitive results and is also used to obtain 
a good probabilistic interpretation. SoftMax is used to calculate the probabilities for all labels. From the existing 
ones, a vector was taken and converted into a one with a value between zero and one, which, when added up, 
is equivalent to one. Additionally, it needs to be noted that the proposed DCNN model is combined with an 
adaptive moment estimation (ADAM) optimizer with a learning rate of 0.001. 


3.4. Evaluation 

The proposed method was evaluated using several stages. The first one compares the recognition 
process based on the split ratio between the training and testing data. Besides, three split ratios were used, 
namely 70:30, 80:20, and 90:10. After obtaining the best, several optimizers were evaluated to prove that the 
selected one is the best for the proposed model. Two popular comparison optimizers, namely root mean square 
propogation (RMSprop) and stochastic gradient descent (SGD), were compared to ADAM. The last evaluation 
was carried out by changing the classifier with several popular ones, including reducing and adding the number 
of FC layers used. This proves that the proposed method uses the most optimal classifier. 


4. RESULTS AND DISCUSSION 

This study employed a private handwritten Javanese script dataset using a notepad application. The 
Javanese script consists of 120 classes constituting basic characters (Carakan) and those compounded with 
vowel scripts (Sandhangan Swara), namely e, é, i, o, and u. Meanwhile, the vowel a has been integrated into 
the basic character, as shown in Figure 1. The dataset used consists of 480 images written with two different 
text thickness levels, which are written in bolder as Figure 3(a) and thinner as Figure 3(b). Next, several 
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preprocessing steps were carried out, namely cropping, converting a grayscale, negative image, and resizing to 
produce a size of 64x64, which are respectively shown in Figures 4(a) to 4(c). 


aT (WY) 


(a) (b) 


Figure 3. “Ha” sample character of Javanese script (a) written with a thick line and 
(b) written with a thin line 


rN) rN) ci 


(a) (b) (c) 


Figure 4. Sample pre-processing results (a) cropped image, (b) complement image, and (c) resized image 


After the pre-processing stage, several image augmentations, such as affine and projective 2- 
dimensional transforms, resizing 10 pixels smaller, squeezing width and height, and rotating 3° and -3°. It is 
important to note that some of these processes cause changes in size, namely affine, projective 2-dimensional 
transforms, and rotation. In this case, the augmented image is resized to 64x64. In the resize augmentation 
process, because the size is reduced by 10 pixels, then 5 paddings are added above, below, left, and right, with 
a value of zero. In the squeezing of width, the image is compressed vertically from 64 pixels to 44 pixels, 
enabling 10 pixels of padding to be added on the right and left. This is also performed for the squeezing height, 
although only horizontal compression is employed in this case. Figures 5(a) to 5(g) shows sample image 
augmentation results. 


(a) (b) (c) (d) (e) (f) (g) 


Figure 5. Sample Augmentation results (a) afine2D, (b) projective2D, (c) resize, (d) rotate 3°, (e) rotate -3°, 
(f) squeezing width, and (g) squeezing height 


With 480 original images added to each of the seven augmented ones, 3,360 datasets were obtained. 
In the next stage, the recognition process is carried out with DCNN. The proposed method, as shown in 
Figure 2, comprises some tuning hyperparameters. The testing process was carried out severally to obtain 
different hyperparameter values, as shown in Table 2. In the training and testing processes, the dataset is 
decomposed into two parts, namely training and testing data. The composition of training data and testing data 
include 70%:30%, 80%:20%, and 90%:10%. Based on this, the most accurate result of 80:20 processes were 
obtained from 100 epochs, as shown in Figure 6. 


Table 2. Summary of tunning hyperparameters 


Hyperparameters Tested Values Optimal Values 
Input image Dimension 64x64 64x64 
Optimizer Adam, RMSprop, SGD Adam 
Dropout 0.2, 0.3, 0.4 Combination 0.2 and 0.3 
Activation functions ReLU, Sigmoid ReLU 
Batch Size 32x32, 64x64 64x64 
Learning Rate 0.0001, 0.001, 0.1 0.001 
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Figure 6(a) shows that the maximum accuracy generated from the training and testing data are 99.73% 
and 99.65%, respectively. Meanwhile, the minimum loss generated from the training and testing data are 
2.01%, and 3.1%, respectively, see Figure 6(b). These results prove that the proposed method is effective 
without overfitting for Javanese Script recognition. In a more detailed manner, the recognition results based on 
the split ratio are shown in Table 3. 
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Figure 6. Recognition results (a) accuracy and (b) loss 


Table 3. Accuracy and loss recognition results based on split ratio 


F : Accuracy (%) Loss (%) 
Spats Training Testing Training Testing 
70:30 98.88 98.13 1.31 1.95 
80:20 99.73 99.65 0.23 0.14 
90:10 99.27 99.48 0.71 0.56 


Table 3 shows that the 80:20 split ratio performed on the data was used to obtain the most accurate 
results. Further analysis was also carried out to examine the accuracy and extent of influence on the augmented 
data. Please note that the method's accuracy was relatively 88.95% before the data augmentation was used. 
This is because the dataset is too small, consisting of four samples for each character, thereby leading to a total 
of 480 images. Furthermore, a split ratio of 75:25 is used, representing three training and one test data. It was 
concluded that the enlarged data significantly affects the recognition accuracy of this small dataset. As stated 
in section 3, the ADAM optimizer was used in this research, and further tests were also carried out on two 
other widely-used optimizers, namely RMSprop and SGD. Figure 7 shows that the accuracy results are pretty 
different. Although the same learning rate was not employed, the ADAM and RMSprop employed a learning 
rate of 0.001, while the SGD used 0.1. The following learning rate was selected based on the best trial values 
of 0.0001, 0.001, and 0.1. Some comparisons were made to prove that the proposed method has a good 
performance by changing the classifier, such as support vector machine (SVM), random forest (RF), multilayer 
perceptron (MLP), and modifying the layer utilized. Recognition experiments were carried out using several 
other approaches to test the effectiveness of CNN feature extraction and compare the classifier performance. 
Table 4 shows the comparison of the recognition results. 
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Figure 7. Comparison of optimizer used 


Table 4. Comparison of recognition results with different methods 


Method Accuracy (%) 
Random Forest (RF) 85.00 
Multilayer Perceptron (MLP) 81.00 
Support Vector Machine (SVM) 88.00 
Proposed Model with 1 FC Layer (500 Neurons) 95.23 
Proposed Model with 3 FC Layer (500, 300, and 100 Neurons) 96.35 
Proposed Model with 7 FC Layer (500, 400, 300, 200, 150, 100, 50 Neurons) 98.75 
Proposed Model without Data Augmentation 88.95 
Proposed Model witth Data Augmentation 99.65 


It should be noted that the results of the comparison in Table 3 were all obtained using augmentation 
data with the same split ratio of 80:20. The results of the proposed model had the best accuracy. This is because 
more FC layers are used to smooth the learning stage. In addition, dropouts at each FC layer tend to reduce 
overfitting and improve recognition accuracy. Afterwards, some commonly used CNN models, such as Alex 
Net with five convolutional layers and VGGNet-16, were also compared. AlexNet and VGGNet were 
compared because these CNN models are effectively used for various classification processes. Based on 
Table 5, the tested method has similar accuracy to the two previous CNN models. The proposed method has a 
training process speed that is much faster than the two CNN models. This shows that it has an excellent 
performance. Additionally, the results obtained using the proposed method are consistent with several previous 
studies on Handwritten Javanese Script Recognition shown in Table 6. Based on these results, few methods 
use datasets with 120 classes 120. The accuracy obtained in this method is better. This is influenced by a 
combination of deep learning and augmentation methods. With fewer datasets, the best accuracy is obtained. 


Table 5. Comparison of recognition results with different methods 


CNN Model Accuracy (%) Training Time (in seconds) 
AlexNet 99.63 665 
VGGNet-16 99.79 1033 
Ours 99.65 457 


Tabel 6. Comparison of Recognition Results with Previous Research 


Method Number of Dataset Class Total Dataset Record Accuracy (%) 
Method [2] 20 2470 70.22 
Method [16] 20 2000 80.65 
Method [4] 20 240 87.50 
Method [1] 20 11500 94.57 
Method [3] 31 620 98.00 
Method [37] 120 5880 97.50 

Ours 120 3360 99.65 


5. CONCLUSION 
Based on the test results in this research, it is proven that the proposed method works effectively for 
recognizing Javanese scripts. Furthermore, basic characters compounded with vowel scripts, totalling 120 
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classes, were investigated. This excellent breakthrough is limited due to the inadequate recognition research 
on high-accuracy Javanese script and many classes with compound characters. It was proven that the proposed 
method has an accuracy of 99.65%. The data augmentation process has also been proven to improve recognition 
by relatively 10% significantly. This shows that it also plays an essential role in recognizing small datasets. 
Future research must be carried out on more complex datasets combined with consonant scripts to improve its 
accuracy. 
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